Genome wide association tests in GEMMA

Serena Caplins, Associate Bioinformatician

2023-06-15

Genome wide association study (GWAS)

In this training module we are going to show you how to perform a genome wide association study or GWAS. We’ll be using the program gemma: Genome-wide Efficient Mixed Model Association. There’s great documentation for gemma which you can find here

One of the seconday goals of this workshop is to keep (almost) nothing hidden from you! Meaning there are often additional file manipulations that happen to get data ready for a workshop that make things a bit more streamlined BUT that in so doing also obscure the real nature of bioinformatics. So in this workshop I will show you everything that has to happen to a vcf file to get it ready to be run in Gemma. Hopefully this will introduce you to a few tricks and maybe even some new tools.

Outline

  1. Brief introduction to a GWAS and why you might want to do one.
  2. Login to Discovery and install programs via Conda
  3. Download and prep the data
  4. Run Gemma and associated programs
  5. Launch RStudio via the OOD app to visualize results
  6. Practice questions
  7. End of session survey

Introduction to a GWAS

A Genome-wide association study is a popular approach that evaluates the connection between genotype to phenotype. And is widely used across multiple fields: Human health, human evolution, animal and plant breeding, evolution of domestication, etc…

Image 1 credit

Alternatively GWAS has been used to evaluate the genetic basis of phenotypes used in crop development and domestication.

Image 2 credit

Image 3 credit

Install packges using Conda

We’ll use Conda to install gemma and do most of our work in the terminal with a little at the end in RStudio to visualize our results.

Login to Discovery using ssh and navigate to your /scratch or /work directory (if you have /work).

And now create a conda environment in our /scratch directory (note if you’re interested in using gemma again after this tutorial and have a /work directory you will want to create your environment in your /work directory).

First create a new folder for this training session with mkdir

We’ll need plink to convert the format of our vcf file to a format that is read as input for gemma. And we need beagle for imputation which we’ll discuss below. We’ll use bcftools to do some file manipulation prior to running gemma.


cd /scratch/s.caplins

mkdir training_gwas

# get on a compute node
srun --partition=reservation --reservation=bootcamp_cpu_2023 --pty /bin/bash

# load a conda module

module load anaconda3/2022.05

# create our environment
conda create --prefix=/scratch/<username>/training_gwas/gemma_env

# activate it
conda activate /scratch/<username>/training_gwas/gemma_env

#note in order to get that preceeding command to work you may need to run `conda init` first

#we need to add the bioconda channel to find and install the packages

conda config --add channels bioconda

# install gemma and two other packages plink, beagle, and bcftools
conda install gemma plink bcftools

Check that gemma is correctly installed


gemma -h 

Download the data

The data for todays workshop comes from a single population of sea slugs which exhibit variation in the type of eggmass that they produce (a binary phenotype). This data has been thinned by LD (linkage disequilibrium) so that it is a manageable size for the workshop. You would normally want to run a GWAS on more SNPs than are present in this dataset. There are 85 individuals and 29k SNPs.

Now our environment is set up. We just need some datafiles to start working with.

The data for this week consist of three files: a vcf file, a meta data file, and a phenotype file. If you’re downloading these locally to a mac please replace wget with curl.


# get the data
wget https://github.com/northeastern-rc/training_GEMMA/raw/main/rcbootcamp_gemmagwas.tar.gz

# untar the data
tar -xzvf rcbootcamp_gemmagwas.tar.gz

#check that the data is in training_gwas

cd training_gwas

ls

Imputation of the data

It is highly recommended to “impute” the genotype likelihood files prior to running the gwas. What does impute mean?

Imputation infers missing genotypes from sampled sites and has been shown to improve the accuracy of Another way to read this is that GWAS is sensitive to missing data. So a good double target approach would be to determine how much missing data we have….and remove samples that are mostly missing (>50%) and impute the rest. In this tutorial we will be removing the missing data and then imputing the remaining samples.

How do you know whether you have missing data or not?

We can use this command here which uses bcftools to first extract the sample names, and genotypes, and then uses awk to sum the genotypes and divide by that sum which returns the proportion of missing data for each sample. You will see that we have a number of samples with missingness greater than 50%!

paste <(bcftools query -f '[%SAMPLE\t]\n' 5.n190_moreSNPS_gwas_thinned.recode.vcf | head -1 | tr '\t' '\n') <(bcftools query -f '[%GT\t]\n' 5.n190_moreSNPS_gwas_thinned.recode.vcf | awk -v OFS="\t" '{for (i=1;i<=NF;i++) if ($i == "./.") sum[i]+=1 } END {for (i in sum) print i, sum[i] / NR }' | sort -k1,1n | cut -f 2)

You can find more helpful scripts like the one above here

Filter based on missingness

We can see that there are a few samples at the end there that have a pretty high proportion of missing data. Let’s remove those from our analyses and impute the rest.

Let’s remove individuals that are more than 50% missing data.


bcftools view -S ^<(paste <(bcftools query -f '[%SAMPLE\t]\n' 5.n190_moreSNPS_gwas_thinned.recode.vcf | head -1 | tr '\t' '\n') <(bcftools query -f '[%GT\t]\n' 5.n190_moreSNPS_gwas_thinned.recode.vcf| awk -v OFS="\t" '{for (i=1;i<=NF;i++) if ($i == "./.") sum[i]+=1 } END {for (i in sum) print i, sum[i] / NR }' | sort -k1,1n | cut -f 2) | awk '{ if ($2 > 0.50) print $1 }') 5.n190_moreSNPS_gwas_thinned.recode.vcf| gzip > 5.n190_moreSNPs_gwas_thinned_nomissing.vcf.gz

Impute with beagle

Now we will impute the remaining samples that have less missing data (some had very little missing data). This takes less than 4 minutes. If we’re short on time we may skip to the next step. Most likely you would run this on a larger data set and it would thus take a little longer. For example, in a test run on a dataset with >500k SNPs it took ~20 minutes to run. You could let it run on your screen like this, but I put it in a batch script.


wget https://faculty.washington.edu/browning/beagle/beagle.27Jan18.7e1.jar

java -jar beagle.27Jan18.7e1.jar gt=5.n190_moreSNPs_gwas_thinned_nomissing.vcf out=imputed_thinned_nomissing

Run gemma

Now lets run our analyses of genome wide association tests. We will first compute a kinship matrix, and then we will use that kinship matrix in our gwas. The -gk parameter tells it which type of relatedness matrix to calculate, either a centered relatedness matrix -gk 1, or a standardized relatedness matrix -gk 2.


gemma -gk -bfile plink -p pheno_bin -o kinship

Now we will use the kinship matrix, our phenotype file, and our genotypes to run the GWAS. There are several other options at this step, including decomposing the kinship matrix into eigenvalues and using the eigenvalues in our gwas. We could also add any number of other covariates including: eigenvalues from a pca to account for population structure, covariates from possible batch effect (sequencing lane, library prep, or other technical artifacts). Please consult the gemma manual for more information.


gemma -bfile plink -k output/kinship.cXX.txt -lmm 4 -o gwas_kinship

Check the output files in the /output directory


ls output

Visualize our results

install packages in R

Navigate to the Open OnDemand app to access RStudio here. You will need to login with your northeastern account information.

We’ll be using the package qqman to make a manhattan plot.

install.packages("qqman")
## Error in contrib.url(repos, "source"): trying to use CRAN without setting a mirror
library(qqman)

make a manhattan plot

Our GWAS results are contained within the file labeled: gwas_kinship.assoc.txt. We’ll start by reading that into R.

#change our directory to our scratch folder

#setwd("/scratch/s.caplins/training_gwas/")

gwas_df<-read.table("/Users/s.caplins/Documents/Bootcamp_training_materials/training_gwas/output/gwas_kinship.assoc.txt", header = T)

Take a look at the dataframe. The test results are located in the last three columns. We have results from a wald test p_wald, a likelihood ratio test p_lrt, and a score test p_score

head(gwas_df)
##   chr rs    ps n_miss allele1 allele0    af        beta         se   logl_H1
## 1   1  .  1360      0       G       T 0.191 -0.01038153 0.08048139 -31.08002
## 2  12  .  1762      0       T       C 0.086  0.03106614 0.11176640 -31.04891
## 3  13  .   266      0       A       C 0.026  0.17880600 0.20707290 -30.70759
## 4  13  .  3748      0       T       A 0.020  0.14090390 0.24282750 -30.91605
## 5  13  .  9158      0       A       C 0.013  0.19685070 0.28295870 -30.84084
## 6  13  . 11658      0       T       C 0.039  0.20810010 0.16631610 -30.29300
##   l_remle l_mle    p_wald     p_lrt   p_score
## 1   1e+05 1e+05 0.8977132 0.8959991 0.8963596
## 2   1e+05 1e+05 0.7818213 0.7782398 0.7790819
## 3   1e+05 1e+05 0.3906564 0.3827219 0.3867291
## 4   1e+05 1e+05 0.5634997 0.5569449 0.5591778
## 5   1e+05 1e+05 0.4888041 0.4815096 0.4844282
## 6   1e+05 1e+05 0.2147909 0.2071657 0.2134913
hist(gwas_df$p_lrt)

Everything looks good to proceed to making our manhattan plot.

the function manhattan requires each SNP to have it’s own “name”. Let’s make a vector for rownumbers that start with the letter r. And we should do some correction for multiple testing, though you will see we don’t find much with this little dataset.

gwas_df$SNP<-paste("r",1:length(gwas_df$chr), sep="")

gwas_df$pvalue<-pchisq(gwas_df$p_lrt, df=1, lower=F)

padj <- p.adjust(gwas_df$pvalue, method = "BH")
alpha <- 0.05
outliers <- which(padj < alpha)
length(outliers)
## [1] 0
hist(padj)

manhattan(gwas_df, chr="chr", bp="ps", p="p_lrt")

Let’s look at a qq-plot of our pvalues to check the model fit

qqnorm(gwas_df$p_score)

Practice questions

  1. Test the impact of imputation and missing data by running the above analyses on the original vcf file 5.n190_moreSNPS_gwas_thinned.recode.vcf. How does it differ from imputation and removing the missing data?
Solution

mkdir test_imputation

cd test_imputation

cp ../pheno_bin .

plink --make-bed --vcf ../5.n190_moreSNPS_gwas_thinned.recode.vcf --chr-set 95 --allow-extra-chr

awk 'FNR==NR{a[NR]=$1;next}{$6=a[FNR]}1' pheno_bin plink.fam > test.fam

head *.fam

mv test.fam plink.fam

gemma -gk -bfile plink -p pheno_bin -o kinship

gemma -bfile plink -k output/kinship.cXX.txt -lmm 4 -o gwas_kinship
GEMMA 0.98.3 (2020-11-28) by Xiang Zhou and team (C) 2012-2020
Reading Files ...
## number of total individuals = 76
## number of analyzed individuals = 76
## number of covariates = 1
## number of phenotypes = 1
## number of total SNPs/var        =    29821
## number of analyzed SNPs         =       39
Calculating Relatedness Matrix ...
================================================== 100%
**** INFO: Done.

Thats as far as we can get because it only analyzed 39 of our 29k SNPs!

  1. GWAS is sensitive to population structure. So it’s important to measure population structure and if present account for it in your association analyses. Or an alternative approach can be to perform a GWAS for a single population where you’re less likely to have genetic differentiation that’s associated with gene flow or genetic drift and your phenotype of interest.

Use a PCA to determine the underlying population structure via the tutorial from earlier today which you can find here

Solution

Coming soon!
## Error: <text>:1:8: unexpected symbol
## 1: Coming soon
##            ^

End of session survey

Thank you so much for attending this session. Please take a few moments to fill out this survey so we can best meet your research computing needs!

Survey